Week 1 - Start me up!

I came to know about Introduction to Open Data Science course from a friend and immediately signed up for it. I think this course will be a great learning experience for me.

You can find my my GitHub repository here.


Week 2 - Regression and model validation

Part 1

Step 1 - Reading data from the local folder as prepared in data wrangling exercise. You can find the code in data folder.

We use head() to check if data is loaded correctly.

data <- read.csv(file = "data/learning2014.csv", row.names = 1)
head(data)
##   gender age attitude     deep  stra     surf points
## 1      F  53      3.7 3.583333 3.375 2.583333     25
## 2      M  55      3.1 2.916667 2.750 3.166667     12
## 3      F  49      2.5 3.500000 3.625 2.250000     24
## 4      M  53      3.5 3.500000 3.125 2.250000     10
## 5      M  49      3.7 3.666667 3.625 2.833333     22
## 6      F  38      3.8 4.750000 3.625 2.416667     21

Step 2 : Exploring the dimensions and basic structure of data using dim() and str() respectively.

dim(data)
## [1] 166   7

The output of dim() shows that there are 166 observations and 7 variables in the analysis dataset.

str(data)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...

The output of str() shows that there are 166 observations of the 7 variables. It shows the details about each of the variable including the name, type and some values. The analysis dataset was created using meta data of JYTOPKYS3. The original data used was collected in 2014-2015 using surveys and contained 60 variables.

The idea of this analysis is to study 3 learning approaches/strategies: deep, strategic and surface.

Below is a brief overview of the variables:

  • gender: is a factor type with two distinct values: “F” and “M”
  • age: age is in years and it is dervice from the date of birth.
  • attitude: is the scaled back (mean) values of 10 questions related to students attitude towards statistics.
  • deep: It contains the mean values of questions in the survey which were related to deep learning approach.
  • stra: It contains the mean values of questions in the survey which were related to strategic learning approach.
  • surf: It contains the mean values of questions in the survey which were related to surface learning approach.
  • points: It contains the scores of students on statistics course exam.

Part 2

We use ggpairs() to visualize a scatter plot matrix of the variables (except gender).

library(ggplot2)
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
pairs(data[-1])

p <- ggpairs(data, mapping = aes(), lower = list(combo = wrap("facethist", bins = 20)))

Using this scatter plot matrix we can observe the distribution of variables and their relationships. For example, observing age, which is represented in first row and first column, we can see that the distribution shows that most of the participants are young (less than 30). We can also observe relations, for example, points and attitude show a clear relationship that as attitude value increases, points increase as well.

We can also analyze individual distributions in detail using hist().

par(mfrow=c(2,3))
hist(data$age, prob=T, main="age")
hist(data$attitude, prob=T, main="attitude")
hist(data$points, prob=T, main="points")
hist(data$deep, prob=T, main="deep")
hist(data$stra, prob=T, main="stra")
hist(data$surf, prob=T, main="surf")

Another graphical summary tools is boxplot. One meaningful way to use it here will be to visualize scores of 3 learning methods for a quick comparison.

boxplot(data[,4:6]) 

Part 3

Explanatory variables chosen = attitude, stra, deep.

my_model <- lm(points ~ attitude + age + deep, data)
summary(my_model)
## 
## Call:
## lm(formula = points ~ attitude + age + deep, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.206  -3.430   0.204   3.979  10.950 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 15.60773    3.38966   4.605 8.32e-06 ***
## attitude     3.59413    0.56959   6.310 2.56e-09 ***
## age         -0.07716    0.05322  -1.450    0.149    
## deep        -0.60275    0.75031  -0.803    0.423    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.307 on 162 degrees of freedom
## Multiple R-squared:  0.2043, Adjusted R-squared:  0.1896 
## F-statistic: 13.87 on 3 and 162 DF,  p-value: 4.305e-08

Looking at the p-value of the F-statistic and t-statitic values in the coefficient table, it seems there is a significant association between points and attitude. In comparison to that, age and deep learning strateft have a weaker association with earned points in the exam.

We can check the confidence interval of the model using confint().

confint(my_model)
##                  2.5 %      97.5 %
## (Intercept)  8.9141112 22.30135604
## attitude     2.4693631  4.71890284
## age         -0.1822595  0.02794462
## deep        -2.0843976  0.87889331

We can remove one of the less significant variable, deep, and repeat the process.

my_model <- lm(points ~ attitude + deep, data)
summary(my_model)
## 
## Call:
## lm(formula = points ~ attitude + deep, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.8463  -3.2975   0.2789   4.1213  11.1399 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  13.7812     3.1574   4.365 2.25e-05 ***
## attitude      3.5780     0.5714   6.262 3.25e-09 ***
## deep         -0.6275     0.7526  -0.834    0.406    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.325 on 163 degrees of freedom
## Multiple R-squared:  0.194,  Adjusted R-squared:  0.1841 
## F-statistic: 19.62 on 2 and 163 DF,  p-value: 2.326e-08
ggpairs(data, lower = list(combo = wrap("facethist", bins = 20)))

Part 4

Analysis: From the summary and charts shows in part 3, below observations can be made:

  • Attitude has the most significant relation with points as compared to other two factors.
  • As this is a case of multiple regression, so R2 increases as more variables are added. Therefore, we consider the adjusted R2, and its value (around 0.18) suggests that although attitude is a significant factor but it is not the only factor influencing points.

Part 5

my_model <- lm(points ~ attitude + age + deep, data)
par(mflow=c(2,2))
## Warning in par(mflow = c(2, 2)): "mflow" is not a graphical parameter
plot(my_model, which=c(1,2,5))

Model assumption: There is a linear correlation between variables and errors are normally distributed.

Observations:

  1. Residuals vs Fitted (1): The plot confirms the linear correlations.
  2. Normal Q-Q (2): There is a clear pattern indicating errors are normally distributed with a clear exception of very low or very high values.
  3. Residuals vs Leverage: The plot is skewed and it can be said that there are no observations (outliers) which stand out significantly (or has unusually hight leverage on data).

Week 3 - Logistic Regression - Analysis

Part 2

alc <- read.csv( file="data/alc.csv", sep=",")
dim(alc)
## [1] 382  35
colnames(alc)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "nursery"    "internet"   "guardian"   "traveltime"
## [16] "studytime"  "failures"   "schoolsup"  "famsup"     "paid"      
## [21] "activities" "higher"     "romantic"   "famrel"     "freetime"  
## [26] "goout"      "Dalc"       "Walc"       "health"     "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"

This data is combined from two data-sets containing responses of student performance questionnaire from math and portuguese classes. This joined data (of students who answered for both questionnaires) has 382 observations of 35 variables. The name of variables can be seen in the output above. The details on original variables collected from questionnaire can be found here. The additional attributes added are:

  • alc_use: combines weekly and weekend alocohol use levels (Dalc and Walc).
  • hight_use: logical column where True value represents that alc_use > 2, otherwise false.

Part 3

The 4 interesting variables in relation to alochol consumption for my analysis are:

  • famrel: quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
  • studytime: weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
  • freetime: free time after school (numeric: from 1 - very low to 5 - very high)
  • goout: going out with friends (numeric: from 1 - very low to 5 - very high)

These variables are chosen to see overall impact of time management on the alcohol consumption. My hypothesis is that if students have more meaningless time (higher values for freetime and goout) and less meaningful time (time spent with family and studying), their alochol consumption will be high.

Part 4

Let’s first see the distributions:

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:GGally':
## 
##     nasa
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)
library(ggplot2)
alc_selected <- select(alc, one_of(c("freetime", "famrel", "studytime", "goout")))
gather(alc_selected) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()

The distributions are skewed. The skewness can be interpreted as:

  • Mostly students probably spent more time with family as family relations are strong for most of the students.
  • Most of the students have a medium or medium-high free time. Only a small propotion have very low free time (1).
  • Most of the students have a medium or medium-low frequency of going out with friends.
  • Mostly students spent 2-5 hours in studying.

qplot(alc_use, studytime, data = alc) + geom_smooth(method = "lm")

g1 <- ggplot(alc, aes(x = high_use, y = studytime))
g1 + geom_boxplot() + ylab("studytime")

Interpretation of studytime: Alcohol use increases as the studytime decreases.

qplot(alc_use, famrel, data = alc) + geom_smooth(method = "lm")

g1 <- ggplot(alc, aes(x = high_use, y = famrel))
g1 + geom_boxplot() + ylab("fam rel")

Interpretation of famrel: Alcohol use increases as the fam-rel strength decreases.

qplot(alc_use, goout, data = alc) + geom_smooth(method = "lm")

g1 <- ggplot(alc, aes(x = high_use, y = goout))
g1 + geom_boxplot() + ylab("go out")

Interpretation of goout: Alcohol use increases as the going out with friends activity increases.

qplot(alc_use, freetime, data = alc) + geom_smooth(method = "lm")

g1 <- ggplot(alc, aes(x = high_use, y = freetime))
g1 + geom_boxplot() + ylab("free time")

Interpretation of freetime: Alcohol use increases as the freetime increases but the change is not significant.

Part 5

Use logistic regression to statistically explore the relationship between your chosen variables and the binary high/low alcohol consumption variable as the target variable. Present and interpret a summary of the fitted model. Present and interpret the coefficients of the model as odds ratios and provide confidence intervals for them. Interpret the results and compare them to your previously stated hypothesis. Hint: If your model includes factor variables see for example the first answer of this stackexchange thread on how R treats and how you should interpret these variables in the model output (or use some other resource to study this). (0-5 points)

m <- glm(high_use ~ studytime + famrel + freetime + goout, data = alc, family = "binomial")
summary(m)
## 
## Call:
## glm(formula = high_use ~ studytime + famrel + freetime + goout, 
##     family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8641  -0.7799  -0.5462   0.8654   2.6080  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -1.1360     0.7523  -1.510 0.131010    
## studytime    -0.5805     0.1659  -3.498 0.000468 ***
## famrel       -0.3732     0.1372  -2.720 0.006537 ** 
## freetime      0.1273     0.1358   0.938 0.348413    
## goout         0.7487     0.1227   6.100 1.06e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 393.03  on 377  degrees of freedom
## AIC: 403.03
## 
## Number of Fisher Scoring iterations: 4

The summary shows that goout (-0.5805), studytime (-0.3732) and famrel(0.7487) are significant variables. In contrast, freetime is not significant as it was also observed in part 2.

The negative sign of goout and studytime show that they are related to lower consumption of alcohol whereas positive sign for freetime show that higher the value of freetime, higher will be the alcohol consumption as well. This is possibly because as students go with friends, they tend to consume alochol more.

So, my hypothesis is true for these variables.

OR <- coef(m) %>% exp
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
cbind(OR, CI)
##                    OR      2.5 %    97.5 %
## (Intercept) 0.3211022 0.07177448 1.3833352
## studytime   0.5596279 0.40001439 0.7681441
## famrel      0.6885197 0.52460597 0.9003017
## freetime    1.1357556 0.87062313 1.4846514
## goout       2.1142144 1.67292633 2.7096221

The value of OR supports that freetime and goout strongly are related to high consumption of alcohol. However, it also shows a significant chance (greater than 50%) of high alochol consumption for famrel and freetime as well.

If we look at 97.5% Odds show a very high probability of high alcohol consumption (value-yes) for goout. It is also significantly high for free time. For studytime and famrel, probabilities are also high with 0.9 for family relation.

Part 6

probabilities <- predict(m, type = "response")

alc <- mutate(alc, probability = probabilities)

alc <- mutate(alc, prediction = probability > 0.5)

prediction_results <- table(high_use = alc$high_use, prediction = alc$prediction)
prediction_results
##         prediction
## high_use FALSE TRUE
##    FALSE   246   22
##    TRUE     72   42

Graphical representation:

g <- ggplot(alc, aes(x = probability, y = high_use, col = prediction))
g + geom_point()

results <- table(high_use = alc$high_use, prediction=alc$prediction)%>%prop.table%>%addmargins

results
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.64397906 0.05759162 0.70157068
##    TRUE  0.18848168 0.10994764 0.29842932
##    Sum   0.83246073 0.16753927 1.00000000

These results show that 0.7539267 or 75% (64.38% FALSE and 10.99% TRUE) results are correctly predicted and approximately 25% of the results (5.75% TRUE and 18.84% False) are incorrectly predicted. Thus, the prediction power is quite good as compared to simple guessing (50% chance of getting right).


Week 4 - Clustering and classification

Part 2

library(MASS);
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
data("Boston")
dim(Boston)
## [1] 506  14
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

This data has 506 observations of 14 variables. The variables are numeric (with types num and int). The variables provide information about housing values in Boston. Details about individual variables can be found here.

Part 3

# Summary
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

The summary shows min, max, mean and quartile values for all the variables. From these values, it seems that all variables have different distributions and can provide different information about the nature of variables. For example, some interesting observations for me are:

  • Mean age is 68.57 which means that 68.57% of owner-occupied units are built prior to 1940.
  • The mean value of rm is 6.285 which shows that averge number per dwelling in the data is greater than 6.
# Graphical overview
pairs(Boston)

The plot matrix shows relationships between pairs of variables. Some observations include visible relation between “nox and age” and “lstat and medv”. This matrix can be used to identify possible relations between different variables and their strengths for further exploration.

Part 4

# Standardize the dataset
boston_scaled <- scale(Boston)
summary(boston_scaled)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865
# Print summary of scaled data
boston_scaled <- as.data.frame(boston_scaled)

The summary shows that all variables are on the same scale with mean 0.

# Create a new variable
brk <- quantile(boston_scaled$crim)
l <- c('low','med_low','med_high','high')
crimeR <- cut(boston_scaled$crim, breaks = brk, include.lowest = TRUE, label=l)

boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crimeR)

summary(boston_scaled$crimeR)
##      low  med_low med_high     high 
##      127      126      126      127

Catergorical variable “crimeR” created and “crim” removed from data as per instructions.

# Divide data into train and test
n <- nrow(boston_scaled)
ind <- sample(n,  size = n * 0.8)
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]

Part 5

# Fit the linear discriminant analysis on the train set.
lda.fit <- lda(crimeR ~ ., data = train)
classes <- as.numeric(train$crimeR)
plot(lda.fit, dimen = 2,col=classes,pch=classes)

Part 6

# Save the crime categories
crimeCat <- test[,"crimeR"]
# Remove crime variable
test_data <- dplyr::select(test, -crimeR)
# Predict categories
lda.pred <- predict(lda.fit, newdata = test)
# Cross tabulate
table(correct = crimeCat, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       11      13        2    0
##   med_low    2      15        5    0
##   med_high   0      12       14    1
##   high       0       0        0   27

This shows that in most of the cases, crime rate was predicted correctly. The prediction was quite accurate for most sensitive part i.e. “high class” where only 1/30 was predicted as med_high instead of high. Just to check overall prediction accuracy we can use:

tabular <- table(correct = crimeCat, predicted = lda.pred$class)
sum(diag(tabular))/sum(tabular)
## [1] 0.6568627

Part 7

# Reload Boston data
library(MASS)
data('Boston')
# scale variables
boston_scaled <- scale(Boston)
# calculate distance between observations
dist_eu <- dist(boston_scaled)
# calculate distance between observations
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970
# Run k-means algorithm on the dataset. 
# Visualize the clusters 
km <-kmeans(Boston, centers = 2) # Simple method of looking in plots is chosen to identify the right number of centers
pairs(Boston, col = km$cluster)

Different number of centers were choosen in previous chunk to see the impact on pairs plot. Based on observations, it seems that ideal cluster size is 2.


Week 5 - Dimensionality reduction techniques

library(MASS)
library(ggplot2)
library(GGally)
library(dplyr)
library(stringr)

human <- read.table("data/human.csv", header = TRUE, sep = ",", row.names = 1)

Part 1 - Data overview

str(human)
## 'data.frame':    155 obs. of  8 variables:
##  $ Edu2.FM  : num  1.007 0.997 0.983 0.989 0.969 ...
##  $ Labo.FM  : num  0.891 0.819 0.825 0.884 0.829 ...
##  $ Edu.Exp  : num  17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
##  $ Life.Exp : num  81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
##  $ GNI      : int  64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
##  $ Mat.Mor  : int  4 6 6 5 6 7 9 28 11 8 ...
##  $ Ado.Birth: num  7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
##  $ Parli.F  : num  39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...

This data has 195 observations and 19 variables. Each row represents data of one country given in the Country variable. All other variables represent indicators related to health, education and employement. Countries are ranked based on these indicators.

summary(human)
##     Edu2.FM          Labo.FM          Edu.Exp         Life.Exp    
##  Min.   :0.1717   Min.   :0.1857   Min.   : 5.40   Min.   :49.00  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:11.25   1st Qu.:66.30  
##  Median :0.9375   Median :0.7535   Median :13.50   Median :74.20  
##  Mean   :0.8529   Mean   :0.7074   Mean   :13.18   Mean   :71.65  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:15.20   3rd Qu.:77.25  
##  Max.   :1.4967   Max.   :1.0380   Max.   :20.20   Max.   :83.50  
##       GNI            Mat.Mor         Ado.Birth         Parli.F     
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50

Observations: There is a high difference between min and max values from the mean values of selected indicators. This can mean that there is a high difference between the indicators in developed and developing countries.

ggpairs(human)

The distributions for different variables are different with most of the variables having skewed distribution. For the relationship between variables, some of the clear observations are: There is a strong correlation between the variable pairs:

  • Edu.Exp and Life.Exp
  • Mat.Mor and Life.Exp
  • Mat.Mor and Edu.Exp
  • Ado.Birth and Edu.Exp
  • Ado.Birth and Mat.Mor

GNI has strongest positive correlation with Edu.Exp and Life.Exp and strongest negative correlation with Ado.Birth and Mat.Mor.

Part 2 - Perform PCA

pca_human <- prcomp(human)
summary(pca_human)
## Importance of components:
##                              PC1      PC2   PC3   PC4   PC5   PC6    PC7
## Standard deviation     1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912
## Proportion of Variance 9.999e-01   0.0001  0.00  0.00 0.000 0.000 0.0000
## Cumulative Proportion  9.999e-01   1.0000  1.00  1.00 1.000 1.000 1.0000
##                           PC8
## Standard deviation     0.1591
## Proportion of Variance 0.0000
## Cumulative Proportion  1.0000
biplot(pca_human, col = c("grey", "blue"), main = "Non-standardized variables")
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

This plot is difficult to understand and makes little sense.

Part 3 - Perform PCA on standardized data.

human_scaled <- scale(human)
pca_human_scaled <- prcomp(human_scaled)
summary(pca_human_scaled)
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6
## Standard deviation     2.0708 1.1397 0.87505 0.77886 0.66196 0.53631
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595
## Cumulative Proportion  0.5361 0.6984 0.79413 0.86996 0.92473 0.96069
##                            PC7     PC8
## Standard deviation     0.45900 0.32224
## Proportion of Variance 0.02634 0.01298
## Cumulative Proportion  0.98702 1.00000
biplot(pca_human_scaled, col = c("grey", "blue"), main = "Standardized variables")

The results are different. The scaled data is normally distributed with mean 0. It is easier to view different variables in this form. The plot is not readible for non-standardize variables but is much more readible and neat for standardized variables.

Part 4 - interpretations of the first two principal component dimensions.

Based on previous parts, my interpretation is that there is a very high positive correlation between Edu.Exp, Life.Exp, Edu2.FM and GNI. Also, Mat.Mor and Ado.Birth are highly positively correlated but have strong negative relation with previously mentioned variables (Edu.Exp, Life.Exp, Edu2.FM, GNI) as the directions are opposite. Parli.G and Labo.F have high positive correlation as well but do not have high correlation with other variables. Only Parli.G and Labo.FM contribute to PC2. All other variables contribute to PC1.

Part 5 - tea dataset from the package Factominer.

library(FactoMineR)
data("tea")
dim(tea)
## [1] 300  36
str(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
##  $ frequency       : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...

This data set seems to represent tea consumption patterns of a population. As the structure is complex, let’s select a few variables for analysis:

cols_tea <- c("Tea", "How", "how","sugar", "frequency", "healthy")
tea <- dplyr::select(tea, one_of(cols_tea))
str(tea)
## 'data.frame':    300 obs. of  6 variables:
##  $ Tea      : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How      : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ how      : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ sugar    : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ frequency: Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
##  $ healthy  : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...

MCA analysis:

mca <- MCA(tea, graph = FALSE)
summary(mca)
## 
## Call:
## MCA(X = tea, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6
## Variance               0.248   0.220   0.194   0.188   0.174   0.166
## % of var.             12.398  11.013   9.685   9.405   8.684   8.314
## Cumulative % of var.  12.398  23.411  33.096  42.501  51.185  59.499
##                        Dim.7   Dim.8   Dim.9  Dim.10  Dim.11  Dim.12
## Variance               0.164   0.152   0.138   0.132   0.119   0.104
## % of var.              8.222   7.619   6.910   6.608   5.926   5.215
## Cumulative % of var.  67.721  75.340  82.250  88.858  94.785 100.000
## 
## Individuals (the 10 first)
##                       Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1                  | -0.026  0.001  0.001 | -0.314  0.149  0.074 |  0.106
## 2                  |  0.343  0.158  0.064 | -0.669  0.678  0.242 |  0.219
## 3                  |  0.241  0.078  0.076 | -0.159  0.038  0.033 | -0.214
## 4                  | -0.501  0.337  0.273 | -0.444  0.299  0.215 | -0.164
## 5                  | -0.022  0.001  0.000 |  0.244  0.090  0.055 | -0.175
## 6                  | -0.069  0.006  0.005 | -0.420  0.267  0.197 | -0.382
## 7                  | -0.034  0.002  0.001 |  0.371  0.208  0.075 | -0.228
## 8                  |  0.194  0.051  0.015 | -0.197  0.059  0.016 |  0.496
## 9                  |  0.169  0.038  0.015 | -0.032  0.002  0.001 |  0.132
## 10                 |  0.968  1.259  0.660 |  0.076  0.009  0.004 |  0.033
##                       ctr   cos2  
## 1                   0.019  0.008 |
## 2                   0.082  0.026 |
## 3                   0.079  0.060 |
## 4                   0.046  0.029 |
## 5                   0.053  0.028 |
## 6                   0.251  0.162 |
## 7                   0.090  0.028 |
## 8                   0.423  0.100 |
## 9                   0.030  0.009 |
## 10                  0.002  0.001 |
## 
## Categories (the 10 first)
##                        Dim.1     ctr    cos2  v.test     Dim.2     ctr
## black              |   0.982  15.979   0.316   9.713 |   0.298   1.654
## Earl Grey          |  -0.435   8.173   0.341 -10.096 |  -0.069   0.233
## green              |   0.341   0.861   0.014   2.074 |  -0.263   0.575
## alone              |   0.045   0.088   0.004   1.055 |   0.263   3.409
## lemon              |  -0.518   1.980   0.033  -3.146 |   0.234   0.458
## milk               |  -0.141   0.279   0.005  -1.253 |  -0.806  10.310
## other              |   1.912   7.369   0.113   5.813 |  -0.925   1.943
## tea bag            |  -0.336   4.308   0.148  -6.650 |  -0.311   4.157
## tea bag+unpackaged |   0.419   3.696   0.080   4.893 |  -0.017   0.007
## unpackaged         |   0.494   1.970   0.033   3.156 |   1.516  20.856
##                       cos2  v.test     Dim.3     ctr    cos2  v.test  
## black                0.029   2.946 |   0.740  11.621   0.179   7.321 |
## Earl Grey            0.009  -1.607 |   0.026   0.037   0.001   0.598 |
## green                0.009  -1.598 |  -1.810  31.002   0.405 -11.002 |
## alone                0.129   6.204 |  -0.425  10.124   0.336 -10.025 |
## lemon                0.007   1.425 |   0.832   6.552   0.086   5.058 |
## milk                 0.172  -7.181 |   0.446   3.589   0.053   3.973 |
## other                0.026  -2.813 |   3.048  23.977   0.287   9.268 |
## tea bag              0.127  -6.157 |   0.041   0.082   0.002   0.811 |
## tea bag+unpackaged   0.000  -0.202 |  -0.019   0.009   0.000  -0.219 |
## unpackaged           0.313   9.677 |  -0.145   0.216   0.003  -0.924 |
## 
## Categorical variables (eta2)
##                      Dim.1 Dim.2 Dim.3  
## Tea                | 0.372 0.033 0.496 |
## How                | 0.145 0.213 0.514 |
## how                | 0.148 0.331 0.004 |
## sugar              | 0.415 0.001 0.082 |
## frequency          | 0.279 0.475 0.064 |
## healthy            | 0.129 0.270 0.002 |

The first dimension explains 12% of the variance in data and second 11% which are quite low figures. Black and lemon are most significant in first dimension. In second dimension, unpackaged, milk and other are most significant.

plot(mca, invisible = c("ind"), habillage = "quali")

plot(mca, invisible = c("var"), habillage = "quali")

Some patterns can be seen here, for example looking at distance between variables we can find following patterns: 1-2 cups/day of tea are healthy, green tea is more healthy compared to black and Early Grey, black tea is mostly taken with no sugar whereas early grey is likely to be taken with sugar, tea with milk is likely to be taken only 1/day and low frequency of tea (1 to 2/week) is closely related to not healthy conditions.


Week 6 - Analysis of longitudinal data

Part 1 - Implement the analyses of Chapter 8 of MABS using the RATS data.

Read the data:

library(dplyr)
library(tidyr)
library(ggplot2)
library(lme4)
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
library(GGally)
library(ggplot2)
RATS <- read.csv(file = "~/IODS-project/data/RATS.csv", header = TRUE, row.names = 1)
str(RATS)
## 'data.frame':    176 obs. of  5 variables:
##  $ ID    : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Group : int  1 1 1 1 1 1 1 1 2 2 ...
##  $ WD    : Factor w/ 11 levels "WD1","WD15","WD22",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ weight: int  240 225 245 260 255 260 275 245 410 405 ...
##  $ Time  : int  1 1 1 1 1 1 1 1 1 1 ...

Factor the categorical variables:

RATS$ID <- factor(RATS$ID)
RATS$Group <- factor(RATS$Group)
str(RATS)
## 'data.frame':    176 obs. of  5 variables:
##  $ ID    : Factor w/ 16 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Group : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 2 2 ...
##  $ WD    : Factor w/ 11 levels "WD1","WD15","WD22",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ weight: int  240 225 245 260 255 260 275 245 410 405 ...
##  $ Time  : int  1 1 1 1 1 1 1 1 1 1 ...
ggplot(RATS, aes(x = Time, y = weight, color=Group, group = ID)) +
  geom_line(aes(linetype = Group)) +
  scale_x_continuous(name = "Time (days)", breaks = seq(0, 60, 10)) +
  scale_y_continuous(name = "Weight (grams)")

ggplot(RATS, aes(x = Time, y = weight, color=Group, group = ID)) +
  geom_line(aes(linetype = Group)) +
  scale_x_continuous(name = "Time (days)", breaks = seq(0, 60, 10)) +
  scale_y_continuous(name = "Weight (grams)") +
  facet_grid(. ~ Group, labeller = label_both)

The graph shows the weight change of 3 groups of rats during the study which has a duration of 60 days. Looking at the graph, we can observe that the tendency to gain weight is higher for group 2 and 3 as compared to group 1.

Let’s standardize the variables

library(ggplot2)
library(dplyr) 
library(tidyr)
library(lme4)
RATS_1 <- RATS %>%
  group_by(Time) %>%
  mutate(stdrats = (weight - mean(weight))/sd(weight) ) %>%
  ungroup()
glimpse(RATS_1)
## Observations: 176
## Variables: 6
## $ ID      <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1…
## $ Group   <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1,…
## $ WD      <fct> WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1, W…
## $ weight  <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 5…
## $ Time    <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8,…
## $ stdrats <dbl> -1.0011429, -1.1203857, -0.9613953, -0.8421525, -0.88190…

Plot using the standardize data:

ggplot(RATS_1, aes(x = Time, y = stdrats, color=ID, group = ID)) +
  geom_line(aes(linetype = Group)) +
  scale_x_continuous(name = "Time (days)", breaks = seq(0, 60, 10)) +
  scale_y_continuous(name = "Standardized Weight (grams)")

ggplot(RATS_1, aes(x = Time, y = stdrats, color=Group, group = ID)) +
  geom_line(aes(linetype = Group)) +
  scale_x_continuous(name = "Time (days)", breaks = seq(0, 60, 10)) +
  scale_y_continuous(name = "Standardized Weight (grams)") +
  facet_grid(. ~ Group, labeller = label_both)

The mean profiles suggest there is a huge difference between the 3 groups with respect to mean weight.

Summary graph using means profiles:

n <- RATS$Time %>% unique() %>% length()
RATS_2 <- RATS_1 %>%
  group_by(Group, WD) %>%
  summarise( mean = mean(weight), se = sd(weight)/sqrt(n) ) %>%
  ungroup()
ggplot(RATS_2, aes(x = WD, y = mean, linetype = "ID", shape = "ID")) +
  geom_line() +
  scale_linetype_manual(values = c(1,2)) +
  geom_point(size=3) +
  scale_shape_manual(values = c(1,2)) +
  geom_errorbar(aes(ymin = mean - se, ymax = mean + se, linetype="1"), width=0.3) +
  theme(legend.position = c(0.8,0.8)) +
  scale_y_continuous(name = "Mean(Weight) +/- Se(Weight)")

This graph shows means profile for individual subjects of experiment. It can be used to compare different subjects based on their single measurements taken as a mean from the measurements over the period of the study.

We can also check the impact of outliers in the study.

RATS_3 <- RATS_1 %>%
  filter(Time > 0) %>%
  group_by(Group, ID) %>%
  summarise( mean=mean(weight) ) %>%
  ungroup()
ggplot(RATS_3, aes(x = Group, y = mean)) +
  geom_boxplot() +
  stat_summary(fun.y = "mean", geom = "point", shape=23, size=4, fill = "blue") +
  scale_y_continuous(name = "mean(Weight)")

There are outliers in data which can be seen clearly here. The most significant outlier can be found in group 2 which we can remove for better analysis.

RATS_4 <- RATS_3 %>% filter(mean <570)
oneway.test(mean~Group,data=RATS_4,var.equal=TRUE) 
## 
##  One-way analysis of means
## 
## data:  mean and Group
## F = 483.6, num df = 2, denom df = 12, p-value = 3.387e-12

Using the results of anova test above, it can be

Part 2 - Implement the analyses of Chapter 9 of MABS using the BPRS data.

Read the data:

BPRS <- read.csv(file = "~/IODS-project/data/BPRS.csv", header = TRUE, row.names = 1)
str(BPRS)
## 'data.frame':    360 obs. of  5 variables:
##  $ treatment: int  1 1 1 1 1 1 1 1 1 1 ...
##  $ subject  : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ weeks    : Factor w/ 9 levels "week0","week1",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ bprs     : int  42 58 54 55 72 48 71 30 41 57 ...
##  $ week     : int  0 0 0 0 0 0 0 0 0 0 ...

Factor the categorical variables and converting weeks from factor to char.

BPRS$treatment <- factor(BPRS$treatment)
BPRS$subject <- factor(BPRS$subject)
BPRS$weeks <- as.character(BPRS$weeks)
str(BPRS)
## 'data.frame':    360 obs. of  5 variables:
##  $ treatment: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ subject  : Factor w/ 20 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ weeks    : chr  "week0" "week0" "week0" "week0" ...
##  $ bprs     : int  42 58 54 55 72 48 71 30 41 57 ...
##  $ week     : int  0 0 0 0 0 0 0 0 0 0 ...

Standardize the data:

BPRS <- BPRS %>%
  group_by(week) %>%
  mutate(stdbprs = (bprs - mean(bprs))/sd(bprs) ) %>%
  ungroup()
str(BPRS)
## Classes 'tbl_df', 'tbl' and 'data.frame':    360 obs. of  6 variables:
##  $ treatment: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ subject  : Factor w/ 20 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ weeks    : chr  "week0" "week0" "week0" "week0" ...
##  $ bprs     : int  42 58 54 55 72 48 71 30 41 57 ...
##  $ week     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ stdbprs  : num  -0.425 0.708 0.425 0.495 1.698 ...
ggpairs(BPRS[,c(1,4:5)])
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Plotting the data:

ggplot(BPRS, aes(x = week, y = bprs, color = subject)) +
  geom_line(aes(linetype = subject)) +
  scale_x_continuous(name = "Week", breaks = seq(0, 60, 10)) +
  scale_y_continuous(name = "bprs") +
  theme(legend.position = "top")

Plotting the data for both treatment types:

ggplot(BPRS, aes(x = week, y = bprs, color=subject)) +
  geom_line() +
  scale_linetype_manual(values = rep(1:10, times=4)) +
  facet_grid(. ~ treatment, labeller = label_both) +
  theme(legend.position = "none") + 
  scale_y_continuous(limits = c(min(BPRS$bprs), max(BPRS$bprs))) +
  theme(legend.position = "top")

ggplot(BPRS, aes(x = week, y = bprs, color = treatment)) +
  geom_line(aes(linetype = treatment)) +
  scale_x_continuous(name = "week", breaks = seq(0, 60, 20)) +
  scale_y_continuous(name = "bprs") +
  theme(legend.position = "top")

Using multiple regression model:

BPRS_1 <- lm(bprs ~ week + treatment, data = BPRS)
summary(BPRS_1)
## 
## Call:
## lm(formula = bprs ~ week + treatment, data = BPRS)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.454  -8.965  -3.196   7.002  50.244 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  46.4539     1.3670  33.982   <2e-16 ***
## week         -2.2704     0.2524  -8.995   <2e-16 ***
## treatment2    0.5722     1.3034   0.439    0.661    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.37 on 357 degrees of freedom
## Multiple R-squared:  0.1851, Adjusted R-squared:  0.1806 
## F-statistic: 40.55 on 2 and 357 DF,  p-value: < 2.2e-16

We can see that week number (time) is more significant than treatment type.

Using the random intercept model:

BPRS_2 <- lmer(bprs ~ week + treatment + (1 | subject), data = BPRS, REML = FALSE)
summary(BPRS_2)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week + treatment + (1 | subject)
##    Data: BPRS
## 
##      AIC      BIC   logLik deviance df.resid 
##   2748.7   2768.1  -1369.4   2738.7      355 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0481 -0.6749 -0.1361  0.4813  3.4855 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  subject  (Intercept)  47.41    6.885  
##  Residual             104.21   10.208  
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  46.4539     1.9090  24.334
## week         -2.2704     0.2084 -10.896
## treatment2    0.5722     1.0761   0.532
## 
## Correlation of Fixed Effects:
##            (Intr) week  
## week       -0.437       
## treatment2 -0.282  0.000

The result shows that estimated variance of the random effects is really high. There are visible changes in values of treatment2 but are not significant.

Using the random intercept and random slope model:

BPRS_3 <- lmer(bprs ~ week + treatment + (week | subject), data = BPRS, REML = FALSE)
summary(BPRS_3)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week + treatment + (week | subject)
##    Data: BPRS
## 
##      AIC      BIC   logLik deviance df.resid 
##   2745.4   2772.6  -1365.7   2731.4      353 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.8919 -0.6194 -0.0691  0.5531  3.7977 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 64.8222  8.0512        
##           week         0.9609  0.9803   -0.51
##  Residual             97.4304  9.8707        
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  46.4539     2.1052  22.066
## week         -2.2704     0.2977  -7.626
## treatment2    0.5722     1.0405   0.550
## 
## Correlation of Fixed Effects:
##            (Intr) week  
## week       -0.582       
## treatment2 -0.247  0.000

Using anova to compare models:

anova(BPRS_3, BPRS_2)
## Data: BPRS
## Models:
## BPRS_2: bprs ~ week + treatment + (1 | subject)
## BPRS_3: bprs ~ week + treatment + (week | subject)
##        Df    AIC    BIC  logLik deviance  Chisq Chi Df Pr(>Chisq)  
## BPRS_2  5 2748.7 2768.1 -1369.4   2738.7                           
## BPRS_3  7 2745.4 2772.6 -1365.7   2731.4 7.2721      2    0.02636 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1